Credit Card Users Churn PredictionΒΆ
Problem StatementΒΆ
Business ContextΒΆ
The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customersβ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same β so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
Data DescriptionΒΆ
- CLIENTNUM: Client number. Unique identifier for the customer holding the account
- Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
- Customer_Age: Age in Years
- Gender: Gender of the account holder
- Dependent_count: Number of dependents
- Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to college student), Post-Graduate, Doctorate
- Marital_Status: Marital Status of the account holder
- Income_Category: Annual Income Category of the account holder
- Card_Category: Type of Card
- Months_on_book: Period of relationship with the bank (in months)
- Total_Relationship_Count: Total no. of products held by the customer
- Months_Inactive_12_mon: No. of months inactive in the last 12 months
- Contacts_Count_12_mon: No. of Contacts in the last 12 months
- Credit_Limit: Credit Limit on the Credit Card
- Total_Revolving_Bal: Total Revolving Balance on the Credit Card
- Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)
- Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
- Total_Trans_Amt: Total Transaction Amount (Last 12 months)
- Total_Trans_Ct: Total Transaction Count (Last 12 months)
- Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
- Avg_Utilization_Ratio: Average Card Utilization Ratio
What Is a Revolving Balance?ΒΆ
- If we don't pay the balance of the revolving credit account in full every month, the unpaid portion carries over to the next month. That's called a revolving balance
What is the Average Open to buy?ΒΆ
- 'Open to Buy' means the amount left on your credit card to use. Now, this column represents the average of this value for the last 12 months.
What is the Average utilization Ratio?ΒΆ
- The Avg_Utilization_Ratio represents how much of the available credit the customer spent. This is useful for calculating credit scores.
Relation b/w Avg_Open_To_Buy, Credit_Limit and Avg_Utilization_Ratio:ΒΆ
- ( Avg_Open_To_Buy / Credit_Limit ) + Avg_Utilization_Ratio = 1
Please read the instructions carefully before starting the project.ΒΆ
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.
- Blanks '_______' are provided in the notebook that needs to be filled with an appropriate code to get the correct result. With every '_______' blank, there is a comment that briefly describes what needs to be filled in the blank space.
- Identify the task to be performed correctly, and only then proceed to write the required code.
- Fill the code wherever asked by the commented lines like "# write your code here" or "# complete the code". Running incomplete code may throw error.
- Please run the codes in a sequential manner from the beginning to avoid any unnecessary errors.
- Add the results/observations (wherever mentioned) derived from the analysis in the presentation and submit the same.
Importing necessary librariesΒΆ
from google.colab import drive, files
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
!pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imbalanced-learn==0.10.1 xgboost==2.0.3 -q --user
!pip install scikit-learn xgboost
Requirement already satisfied: scikit-learn in /root/.local/lib/python3.11/site-packages (1.2.2) Requirement already satisfied: xgboost in /root/.local/lib/python3.11/site-packages (2.0.3) Requirement already satisfied: numpy>=1.17.3 in /root/.local/lib/python3.11/site-packages (from scikit-learn) (1.25.2) Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.11/dist-packages (from scikit-learn) (1.13.1) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.11/dist-packages (from scikit-learn) (1.4.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn) (3.5.0)
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imblearn==0.12.0 xgboost==2.0.3 -q --user
# !pip install --upgrade -q threadpoolctl
# help read and manipulate data
import pandas as pd
import numpy as np
# data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# missing value imputation
from sklearn.impute import SimpleImputer
# model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
RandomForestClassifier,
AdaBoostClassifier,
GradientBoostingClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
# metric scores and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold, RandomizedSearchCV
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay,
)
# data scaling and ne hot encoding
from sklearn.preprocessing import StandardScaler, OneHotEncoder, minmax_scale
# tuning model
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
# smote
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import classification_report
# feature selection
from sklearn.feature_selection import RFE
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
import xgboost as xgb
from sklearn.tree import DecisionTreeClassifier
# To supress warnings
import warnings
warnings.filterwarnings('ignore')
#
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
Loading the datasetΒΆ
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/PROJECT 3/BankChurners.csv')
df = data.copy()
Data OverviewΒΆ
- Observations
- Sanity checks
df.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
df.shape
(10127, 21)
10127 records with 21 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
Attrition_Flag, Gender, Education_Level, Marital_Status, Income_Category, Card_Category are all objects. I will have to find a way to encode these variables into integers
df.drop(columns=['CLIENTNUM'], inplace=True)
df.isnull().sum()
| 0 | |
|---|---|
| Attrition_Flag | 0 |
| Customer_Age | 0 |
| Gender | 0 |
| Dependent_count | 0 |
| Education_Level | 1519 |
| Marital_Status | 749 |
| Income_Category | 0 |
| Card_Category | 0 |
| Months_on_book | 0 |
| Total_Relationship_Count | 0 |
| Months_Inactive_12_mon | 0 |
| Contacts_Count_12_mon | 0 |
| Credit_Limit | 0 |
| Total_Revolving_Bal | 0 |
| Avg_Open_To_Buy | 0 |
| Total_Amt_Chng_Q4_Q1 | 0 |
| Total_Trans_Amt | 0 |
| Total_Trans_Ct | 0 |
| Total_Ct_Chng_Q4_Q1 | 0 |
| Avg_Utilization_Ratio | 0 |
# Checking for the null value in the dataset
df.isna().sum()
| 0 | |
|---|---|
| Attrition_Flag | 0 |
| Customer_Age | 0 |
| Gender | 0 |
| Dependent_count | 0 |
| Education_Level | 1519 |
| Marital_Status | 749 |
| Income_Category | 0 |
| Card_Category | 0 |
| Months_on_book | 0 |
| Total_Relationship_Count | 0 |
| Months_Inactive_12_mon | 0 |
| Contacts_Count_12_mon | 0 |
| Credit_Limit | 0 |
| Total_Revolving_Bal | 0 |
| Avg_Open_To_Buy | 0 |
| Total_Amt_Chng_Q4_Q1 | 0 |
| Total_Trans_Amt | 0 |
| Total_Trans_Ct | 0 |
| Total_Ct_Chng_Q4_Q1 | 0 |
| Avg_Utilization_Ratio | 0 |
Seems that education level, and Marital Status have some null records. However, that doesn't mean that other columns don't have variables that indicate null. For example, after further inspection of salary, some records indicate abc which could be perceived as null.
{column: list(data[column].unique()) for column in data.select_dtypes('object').columns}
{'Attrition_Flag': ['Existing Customer', 'Attrited Customer'],
'Gender': ['M', 'F'],
'Education_Level': ['High School',
'Graduate',
'Uneducated',
nan,
'College',
'Post-Graduate',
'Doctorate'],
'Marital_Status': ['Married', 'Single', nan, 'Divorced'],
'Income_Category': ['$60K - $80K',
'Less than $40K',
'$80K - $120K',
'$40K - $60K',
'$120K +',
'abc'],
'Card_Category': ['Blue', 'Gold', 'Silver', 'Platinum']}
Fill nan with unknown
# Set any nan value to "Unknown"
df['Education_Level'] = df['Education_Level'].fillna('Unknown')
df['Marital_Status'] = df['Marital_Status'].fillna('Unknown')
# replace all records of 'abc' with nan
df['Income_Category'] = df['Income_Category'].replace('abc', 'Unknown')
df['Education_Level'].unique()
array(['High School', 'Graduate', 'Uneducated', 'Unknown', 'College',
'Post-Graduate', 'Doctorate'], dtype=object)
df['Marital_Status'].unique()
array(['Married', 'Single', 'Unknown', 'Divorced'], dtype=object)
df['Income_Category'].unique()
array(['$60K - $80K', 'Less than $40K', '$80K - $120K', '$40K - $60K',
'$120K +', 'Unknown'], dtype=object)
df['Education_Level'].unique()
array(['High School', 'Graduate', 'Uneducated', 'Unknown', 'College',
'Post-Graduate', 'Doctorate'], dtype=object)
Column ChangeΒΆ
Change Object data type to int
I need to decide whether there should be order for the column or if I should one hot encode. Here are the data types that are object:
Attrition_Flag, Gender, Education_Level, Marital_Status, Income_Category, Card_Category.
The only columns that have some type of order is Income_Category and Education.
Income Category is an ordered data type because it describes if a person makes a "range." ($60K - $80K', 'Less than $40K', '$80K - $120K', '$40K - $60K', '$120K +', 'abc').
Since Income_Category is a range, I will make 1 = Less than $40K, 2 = $60K - $80K, etc.
I first need to mark 'abc' as null.
Education is ordinal because there is an order to education. Uneducated, Highschool, College student, College graduated, Post-Graduate, Doctorate, and Unknown
I will begin mapping these two columns
{column: list(df[column].unique()) for column in df.select_dtypes('object').columns}
{'Attrition_Flag': ['Existing Customer', 'Attrited Customer'],
'Gender': ['M', 'F'],
'Education_Level': ['High School',
'Graduate',
'Uneducated',
'Unknown',
'College',
'Post-Graduate',
'Doctorate'],
'Marital_Status': ['Married', 'Single', 'Unknown', 'Divorced'],
'Income_Category': ['$60K - $80K',
'Less than $40K',
'$80K - $120K',
'$40K - $60K',
'$120K +',
'Unknown'],
'Card_Category': ['Blue', 'Gold', 'Silver', 'Platinum']}
# iterate through the income category and make a dictionairy
numbers = [0,1,2,3,4,5,6]
income_category = ['Unknown','Less than $40K', '$40K - $60K', '$60K - $80K', '$80K - $120K', '$120K +' ]
income_dict = dict(zip(income_category, numbers))
income_dict
{'Unknown': 0,
'Less than $40K': 1,
'$40K - $60K': 2,
'$60K - $80K': 3,
'$80K - $120K': 4,
'$120K +': 5}
# map the income_dict to the Income_Category column.
df['Income_Category'] = df['Income_Category'].map(income_dict)
df['Income_Category'].unique()
array([3, 1, 4, 2, 5, 0])
# Step 1: Replace NaN with 'Unknown' (or '0')
df['Education_Level'] = df['Education_Level'].apply(lambda x: 'Unknown' if pd.isna(x) else x)
# Step 2: Create the dictionary for mapping (order from 0 to Doctorate)
education_order = ['Unknown', 'Uneducated', 'High School', 'College', 'Post-Graduate', 'Doctorate', 'Graduate']
numbers = [0, 1, 2, 3, 4, 5, 6] # corresponding numeric values for the education levels
education_dict = dict(zip(education_order, numbers))
# Step 3: Map the Education_Level column to the dictionary
df['Education_Level_Mapped'] = df['Education_Level'].map(education_dict)
df['Education_Level'] = df['Education_Level_Mapped']
df['Education_Level'].unique()
array([2, 6, 1, 0, 3, 4, 5])
Now that education has an ordering system and blank values are now nan I can now move forward to one hot encoding the rest of the object columns
{column: list(df[column].unique()) for column in df.select_dtypes('object').columns}
{'Attrition_Flag': ['Existing Customer', 'Attrited Customer'],
'Gender': ['M', 'F'],
'Marital_Status': ['Married', 'Single', 'Unknown', 'Divorced'],
'Card_Category': ['Blue', 'Gold', 'Silver', 'Platinum']}
Now I must encode Attrition_Flag, Gender, Marital_Status, and Card_Category since there is now order.
# encode all object columns into dummies
df = pd.get_dummies(df, columns=['Attrition_Flag', 'Gender', 'Marital_Status', 'Card_Category'], drop_first=True)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Customer_Age 10127 non-null int64 1 Dependent_count 10127 non-null int64 2 Education_Level 10127 non-null int64 3 Income_Category 10127 non-null int64 4 Months_on_book 10127 non-null int64 5 Total_Relationship_Count 10127 non-null int64 6 Months_Inactive_12_mon 10127 non-null int64 7 Contacts_Count_12_mon 10127 non-null int64 8 Credit_Limit 10127 non-null float64 9 Total_Revolving_Bal 10127 non-null int64 10 Avg_Open_To_Buy 10127 non-null float64 11 Total_Amt_Chng_Q4_Q1 10127 non-null float64 12 Total_Trans_Amt 10127 non-null int64 13 Total_Trans_Ct 10127 non-null int64 14 Total_Ct_Chng_Q4_Q1 10127 non-null float64 15 Avg_Utilization_Ratio 10127 non-null float64 16 Education_Level_Mapped 10127 non-null int64 17 Attrition_Flag_Existing Customer 10127 non-null bool 18 Gender_M 10127 non-null bool 19 Marital_Status_Married 10127 non-null bool 20 Marital_Status_Single 10127 non-null bool 21 Marital_Status_Unknown 10127 non-null bool 22 Card_Category_Gold 10127 non-null bool 23 Card_Category_Platinum 10127 non-null bool 24 Card_Category_Silver 10127 non-null bool dtypes: bool(8), float64(5), int64(12) memory usage: 1.4 MB
Attrition_Flag_Existing Customer is a binary column indicating whether a customer is an "Existing Customer" (1) or an "Attrited Customer" (0).
# change bool values to 0's and 1's
for column in df.columns:
if df[column].dtype == 'bool':
df[column] = df[column].astype(int)
df.head()
| Customer_Age | Dependent_count | Education_Level | Income_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | Education_Level_Mapped | Attrition_Flag_Existing Customer | Gender_M | Marital_Status_Married | Marital_Status_Single | Marital_Status_Unknown | Card_Category_Gold | Card_Category_Platinum | Card_Category_Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 45 | 3 | 2 | 3 | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| 1 | 49 | 5 | 6 | 1 | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 | 6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 51 | 3 | 6 | 4 | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 | 6 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| 3 | 40 | 4 | 2 | 1 | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 | 2 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4 | 40 | 3 | 1 | 3 | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
All columns are now either int, bool, or float. Lets look at the nulls again
df.isnull().sum()
| 0 | |
|---|---|
| Customer_Age | 0 |
| Dependent_count | 0 |
| Education_Level | 0 |
| Income_Category | 0 |
| Months_on_book | 0 |
| Total_Relationship_Count | 0 |
| Months_Inactive_12_mon | 0 |
| Contacts_Count_12_mon | 0 |
| Credit_Limit | 0 |
| Total_Revolving_Bal | 0 |
| Avg_Open_To_Buy | 0 |
| Total_Amt_Chng_Q4_Q1 | 0 |
| Total_Trans_Amt | 0 |
| Total_Trans_Ct | 0 |
| Total_Ct_Chng_Q4_Q1 | 0 |
| Avg_Utilization_Ratio | 0 |
| Education_Level_Mapped | 0 |
| Attrition_Flag_Existing Customer | 0 |
| Gender_M | 0 |
| Marital_Status_Married | 0 |
| Marital_Status_Single | 0 |
| Marital_Status_Unknown | 0 |
| Card_Category_Gold | 0 |
| Card_Category_Platinum | 0 |
| Card_Category_Silver | 0 |
Everything looks good. There are no nulls.
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Customer_Age | 10127.000 | 46.326 | 8.017 | 26.000 | 41.000 | 46.000 | 52.000 | 73.000 |
| Dependent_count | 10127.000 | 2.346 | 1.299 | 0.000 | 1.000 | 2.000 | 3.000 | 5.000 |
| Education_Level | 10127.000 | 3.124 | 2.265 | 0.000 | 1.000 | 3.000 | 6.000 | 6.000 |
| Income_Category | 10127.000 | 2.086 | 1.475 | 0.000 | 1.000 | 2.000 | 3.000 | 5.000 |
| Months_on_book | 10127.000 | 35.928 | 7.986 | 13.000 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.000 | 3.813 | 1.554 | 1.000 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.000 | 2.341 | 1.011 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.000 | 2.455 | 1.106 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.000 | 8631.954 | 9088.777 | 1438.300 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.000 | 1162.814 | 814.987 | 0.000 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.000 | 7469.140 | 9090.685 | 3.000 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.000 | 0.760 | 0.219 | 0.000 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.000 | 4404.086 | 3397.129 | 510.000 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.000 | 64.859 | 23.473 | 10.000 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.000 | 0.712 | 0.238 | 0.000 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.000 | 0.275 | 0.276 | 0.000 | 0.023 | 0.176 | 0.503 | 0.999 |
| Education_Level_Mapped | 10127.000 | 3.124 | 2.265 | 0.000 | 1.000 | 3.000 | 6.000 | 6.000 |
| Attrition_Flag_Existing Customer | 10127.000 | 0.839 | 0.367 | 0.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Gender_M | 10127.000 | 0.471 | 0.499 | 0.000 | 0.000 | 0.000 | 1.000 | 1.000 |
| Marital_Status_Married | 10127.000 | 0.463 | 0.499 | 0.000 | 0.000 | 0.000 | 1.000 | 1.000 |
| Marital_Status_Single | 10127.000 | 0.389 | 0.488 | 0.000 | 0.000 | 0.000 | 1.000 | 1.000 |
| Marital_Status_Unknown | 10127.000 | 0.074 | 0.262 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| Card_Category_Gold | 10127.000 | 0.011 | 0.106 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| Card_Category_Platinum | 10127.000 | 0.002 | 0.044 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
| Card_Category_Silver | 10127.000 | 0.055 | 0.228 | 0.000 | 0.000 | 0.000 | 0.000 | 1.000 |
Average Age: 46,
Average Education Level: Post-Graduate,
Average Dependent Count: 2,
Average Income: $60K - $80K,
Average Credit Limit: $8,631.95,
Average Months on Book: 35.93,
Average Total Relationship Count: 3.81,
Average Total Transaction Amount: $4,404.09,
Average Total Transaction Count: 64.86,
Average Utilization Ratio: 27.49%,
The values that are mapped do not appear on this statistic list.
Lets explore these summary statistics more in eda.
Exploratory Data Analysis (EDA)ΒΆ
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Questions:
- How is the total transaction amount distributed?
- What is the distribution of the level of education of customers?
- What is the distribution of the level of income of customers?
- How does the change in transaction amount between Q4 and Q1 (
total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)? - How does the number of months a customer was inactive in the last 12 months (
Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)? - What are the attributes that have a strong correlation with each other?
Answers:
The total transactional amount is right skewed which means the outliers are on the right side of the boxplot. The mean is around $4,404
Education is left skewed and the mean is about 3 (or post graduate).
The Income of the customers seems to have a median of "2" which is $40K - $60K'. The most common value is 1. The values have a skew slightly towards the right.
Mean Total_Ct_Chng_Q4_Q1 for Attrited Customer: 0.5543859864781807 Mean Total_Ct_Chng_Q4_Q1 for Existing Customer: 0.7424341176470589 On average, Attrited customers had a lower change from Q4 to Q1. Meaning that between Q4 and Q1, attrited customers spent less than existing customers. The change was -0.188
Customers that were flagged as attrited had slightly higher months of inactivity (2.693300553165335) than customers that are existing (2.273764705882353).
It seems like Avg_Open_To_Buy and Credit_Limit are the most correlated, second is Customer Age and Months on Books, Third is Gender_M and Income_Category.
The below functions need to be defined to carry out the Exploratory Data Analysis.ΒΆ
Data Visualization CodeΒΆ
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# Adjust layout to avoid overlap
plt.tight_layout()
plt.show()
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.tight_layout()
plt.show() # show the plot
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### Function to plot distributions
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
Overall observations (Univariate and multivariate w/ Dependent variable)ΒΆ
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Customer_Age 10127 non-null int64 1 Dependent_count 10127 non-null int64 2 Education_Level 10127 non-null int64 3 Income_Category 10127 non-null int64 4 Months_on_book 10127 non-null int64 5 Total_Relationship_Count 10127 non-null int64 6 Months_Inactive_12_mon 10127 non-null int64 7 Contacts_Count_12_mon 10127 non-null int64 8 Credit_Limit 10127 non-null float64 9 Total_Revolving_Bal 10127 non-null int64 10 Avg_Open_To_Buy 10127 non-null float64 11 Total_Amt_Chng_Q4_Q1 10127 non-null float64 12 Total_Trans_Amt 10127 non-null int64 13 Total_Trans_Ct 10127 non-null int64 14 Total_Ct_Chng_Q4_Q1 10127 non-null float64 15 Avg_Utilization_Ratio 10127 non-null float64 16 Education_Level_Mapped 10127 non-null int64 17 Attrition_Flag_Existing Customer 10127 non-null int64 18 Gender_M 10127 non-null int64 19 Marital_Status_Married 10127 non-null int64 20 Marital_Status_Single 10127 non-null int64 21 Marital_Status_Unknown 10127 non-null int64 22 Card_Category_Gold 10127 non-null int64 23 Card_Category_Platinum 10127 non-null int64 24 Card_Category_Silver 10127 non-null int64 dtypes: float64(5), int64(20) memory usage: 1.9 MB
Customer AgeΒΆ
# lets analyze customer's age
histogram_boxplot(df, 'Customer_Age')
Median age is about 45
# create a staked countplot of age and existing customer
stacked_barplot(df, 'Customer_Age', 'Attrition_Flag_Existing Customer')
Attrition_Flag_Existing Customer 0 1 All Customer_Age All 1627 8500 10127 43 85 388 473 48 85 387 472 44 84 416 500 46 82 408 490 45 79 407 486 49 79 416 495 47 76 403 479 41 76 303 379 50 71 381 452 54 69 238 307 40 64 297 361 42 62 364 426 53 59 328 387 52 58 318 376 51 58 340 398 55 51 228 279 39 48 285 333 38 47 256 303 56 43 219 262 59 40 117 157 37 37 223 260 57 33 190 223 58 24 133 157 36 24 197 221 35 21 163 184 33 20 107 127 34 19 127 146 32 17 89 106 61 17 76 93 62 17 76 93 30 15 55 70 31 13 78 91 60 13 114 127 65 9 92 101 63 8 57 65 29 7 49 56 26 6 72 78 64 5 38 43 27 3 29 32 28 1 28 29 66 1 1 2 68 1 1 2 67 0 4 4 70 0 1 1 73 0 1 1 ------------------------------------------------------------------------------------------------------------------------
In terms of attrited customers, most of the attrited customers are on the younger side (43-54). Existing customer are 44 -54 age.
Dependent CountΒΆ
histogram_boxplot(df, 'Dependent_count')
# create a staked countplot of dependent and existing customer
stacked_barplot(df, 'Dependent_count', 'Attrition_Flag_Existing Customer')
Attrition_Flag_Existing Customer 0 1 All Dependent_count All 1627 8500 10127 3 482 2250 2732 2 417 2238 2655 1 269 1569 1838 4 260 1314 1574 0 135 769 904 5 64 360 424 ------------------------------------------------------------------------------------------------------------------------
Not much to report on.
Months_on_bookΒΆ
# create boxplot of Months_on_book
histogram_boxplot(df, 'Months_on_book')
df['Months_on_book'].mean()
35.928409203120374
Months on books is about 35
# stacked bar plot between existing customers and attrited customers and the months
stacked_barplot(df, 'Months_on_book', 'Attrition_Flag_Existing Customer')
Attrition_Flag_Existing Customer 0 1 All Months_on_book All 1627 8500 10127 36 430 2033 2463 39 64 277 341 37 62 296 358 30 58 242 300 38 57 290 347 34 57 296 353 41 51 246 297 33 48 257 305 40 45 288 333 35 45 272 317 32 44 245 289 28 43 232 275 44 42 188 230 43 42 231 273 46 36 161 197 42 36 235 271 29 34 207 241 31 34 284 318 45 33 194 227 25 31 134 165 24 28 132 160 48 27 135 162 50 25 71 96 49 24 117 141 26 24 162 186 47 24 147 171 27 23 183 206 22 20 85 105 56 17 86 103 51 16 64 80 18 13 45 58 20 13 61 74 52 12 50 62 23 12 104 116 21 10 73 83 15 9 25 34 53 7 71 78 13 7 63 70 19 6 57 63 54 6 47 53 17 4 35 39 55 4 38 42 16 3 26 29 14 1 15 16 ------------------------------------------------------------------------------------------------------------------------
The attrited customers have a higher count when the months on the books is lower. Existing customers have a higher proportion as the months on books increases.
Total Relationship CountΒΆ
# Total_Relationship_Count: Total no. of products held by the customer
# create a histogram of total relationship count
histogram_boxplot(df, 'Total_Relationship_Count')
Total relationship count is 4
# stacked plot between total relationship count v existing customer
stacked_barplot(df, 'Total_Relationship_Count', 'Attrition_Flag_Existing Customer')
Attrition_Flag_Existing Customer 0 1 All Total_Relationship_Count All 1627 8500 10127 3 400 1905 2305 2 346 897 1243 1 233 677 910 5 227 1664 1891 4 225 1687 1912 6 196 1670 1866 ------------------------------------------------------------------------------------------------------------------------
two and one has the highest proportion of attrited customers. Meaning that the number of products a person who attrited is on the lower side.
Contacts_Count_12_monΒΆ
No. of Contacts in the last 12 months
# histplot of Contacts_Count_12_mon
histogram_boxplot(df, 'Contacts_Count_12_mon')
Median is around 2.5 contacts
# count plot of Contacts_Count_12_mon and existing customert
stacked_barplot(df, 'Contacts_Count_12_mon', 'Attrition_Flag_Existing Customer')
Attrition_Flag_Existing Customer 0 1 All Contacts_Count_12_mon All 1627 8500 10127 3 681 2699 3380 2 403 2824 3227 4 315 1077 1392 1 108 1391 1499 5 59 117 176 6 54 0 54 0 7 392 399 ------------------------------------------------------------------------------------------------------------------------
Highest proportion of attrited customers tends to have higher contacts.
Credit LimitΒΆ
# credit limit histogram
histogram_boxplot(df, 'Credit_Limit')
Right skewed. Median is around 5000
# dist plot and credit limit v existing
distribution_plot_wrt_target(df, 'Credit_Limit', 'Attrition_Flag_Existing Customer')
Same shape with existing v attrited. (Right skewed). Existing customers have higher outliers than non existing customers
Total_Revolving_BalΒΆ
Total_Revolving_Bal: Total Revolving Balance on the Credit Card
# histogram total revolving balance
histogram_boxplot(df, 'Total_Revolving_Bal')
# dist plot and credit limit v existing
distribution_plot_wrt_target(df, 'Total_Revolving_Bal', 'Attrition_Flag_Existing Customer')
Existing customers tend to have a higher revolving balance than attrited customers. Attrited customers either have majority 0-500 dollars in revolving balance or they have slightly lower than 2500.
Avg_Open_To_BuyΒΆ
Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)
# hist plot of Avg_Open_To_Buy
histogram_boxplot(df, 'Avg_Open_To_Buy')
Right skewed
# dist plot and Avg_Open_To_Buy v existing
distribution_plot_wrt_target(df, 'Avg_Open_To_Buy', 'Attrition_Flag_Existing Customer')
Same right skewed shape between attrited customers and existing customer. Existing customers have more outliers
Answer Q1 (Transaction Amount)ΒΆ
df.Total_Trans_Amt.describe()
| Total_Trans_Amt | |
|---|---|
| count | 10127.000 |
| mean | 4404.086 |
| std | 3397.129 |
| min | 510.000 |
| 25% | 2155.500 |
| 50% | 3899.000 |
| 75% | 4741.000 |
| max | 18484.000 |
histogram_boxplot(df, 'Total_Trans_Amt')
Answer Q2 (education Level)ΒΆ
histogram_boxplot(df, 'Education_Level')
distribution_plot_wrt_target(df, 'Education_Level', 'Attrition_Flag_Existing Customer')
Education has the same shape between attrited and existing customers.
Answer Q.3 (Income)ΒΆ
df.Income_Category.value_counts()
| count | |
|---|---|
| Income_Category | |
| 1 | 3561 |
| 2 | 1790 |
| 4 | 1535 |
| 3 | 1402 |
| 0 | 1112 |
| 5 | 727 |
df.Income_Category.value_counts().plot(kind='bar')
<Axes: xlabel='Income_Category'>
df.Income_Category.mean()
2.0857114644020935
histogram_boxplot(df,'Income_Category')
labeled_barplot(df, 'Income_Category')
distribution_plot_wrt_target(df, 'Income_Category', 'Attrition_Flag_Existing Customer')
Q4 (total_ct_change_Q4_Q1)ΒΆ
How does the change in transaction amount between Q4 and Q1 (total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?
df.Total_Ct_Chng_Q4_Q1.describe()
| Total_Ct_Chng_Q4_Q1 | |
|---|---|
| count | 10127.000 |
| mean | 0.712 |
| std | 0.238 |
| min | 0.000 |
| 25% | 0.582 |
| 50% | 0.702 |
| 75% | 0.818 |
| max | 3.714 |
df['Attrition_Flag_Existing Customer'] = df['Attrition_Flag_Existing Customer'].astype(int)
Attrited_Customer = df[df['Attrition_Flag_Existing Customer'] == 0]
Existing_Customer = df[df['Attrition_Flag_Existing Customer'] == 1]
print(f"Attrited Customer: {len(Attrited_Customer)}")
print(f"Existing Customer: {len(Existing_Customer)}")
Attrited Customer: 1627 Existing Customer: 8500
mean_attrited = Attrited_Customer['Total_Ct_Chng_Q4_Q1'].mean()
mean_existing = Existing_Customer['Total_Ct_Chng_Q4_Q1'].mean()
print(f"Mean Total_Ct_Chng_Q4_Q1 for Attrited Customer: {mean_attrited}")
print(f"Mean Total_Ct_Chng_Q4_Q1 for Existing Customer: {mean_existing}")
Mean Total_Ct_Chng_Q4_Q1 for Attrited Customer: 0.5543859864781807 Mean Total_Ct_Chng_Q4_Q1 for Existing Customer: 0.7424341176470589
total_ct_change_Q4_Q1 = Attrited_Customer['Total_Ct_Chng_Q4_Q1'].mean() - Existing_Customer['Total_Ct_Chng_Q4_Q1'].mean()
total_ct_change_Q4_Q1
-0.1880481311688782
# Adjust the barplot code above to show existing customer v attrited customer
def labeled_barplot(data, feature, target, perc=False, n=None):
"""
Barplot with percentage or count at the top
data: dataframe
feature: dataframe column (categorical feature for grouping, e.g., Attrition)
target: numerical column (e.g., Total_Ct_Chng_Q4_Q1 to show aggregate statistics)
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
# Create a new column to label the customer types (Attrited vs Existing)
# I will drop this column because it is string.
df['Customer_Type'] = df['Attrition_Flag_Existing Customer'].apply(
lambda x: 'Existing Customer' if x == 1 else 'Attrited Customer')
# Aggregate the data (group by customer type and calculate mean of the target variable)
grouped_data = df.groupby('Customer_Type')[target].mean().reset_index()
# Set up the plot size
if n is None:
plt.figure(figsize=(len(grouped_data) + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=10, fontsize=12)
ax = sns.barplot(
data=grouped_data,
x='Customer_Type',
y=target,
palette="Paired"
)
# Annotate bars with values (percentage or count)
for p in ax.patches:
if perc:
label = "{:.1f}%".format(100 * p.get_height() / data[target].sum())
else:
label = "{:.1f}".format(p.get_height()) # mean of Total_Ct_Chng_Q4_Q1
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the value (mean of Total_Ct_Chng_Q4_Q1)
# Add labels and title
plt.title(f'{target} by Customer Type')
plt.xlabel('Customer Type')
plt.ylabel(f'Mean {target}')
plt.tight_layout()
plt.show() # Show the plot
# Example usage:
labeled_barplot(df, 'Customer_Type', 'Total_Ct_Chng_Q4_Q1', perc=False)
# df.drop('Customer_Type', axis=1, errors='ignore', inplace=True)
Q5 (Months_Inactive_12_mon)ΒΆ
How does the number of months a customer was inactive in the last 12 months (Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?
df['Months_Inactive_12_mon'].describe()
| Months_Inactive_12_mon | |
|---|---|
| count | 10127.000 |
| mean | 2.341 |
| std | 1.011 |
| min | 0.000 |
| 25% | 2.000 |
| 50% | 2.000 |
| 75% | 3.000 |
| max | 6.000 |
The mean number of months between both attrition and existing customers is 2.3 months. Lets look at each type of custome more in detail
Attrition = 0, Existing = 1
# Inactive v. Attrition Customer
df.groupby(['Attrition_Flag_Existing Customer', 'Months_Inactive_12_mon'])['Months_Inactive_12_mon'].count()
| Months_Inactive_12_mon | ||
|---|---|---|
| Attrition_Flag_Existing Customer | Months_Inactive_12_mon | |
| 0 | 0 | 15 |
| 1 | 100 | |
| 2 | 505 | |
| 3 | 826 | |
| 4 | 130 | |
| 5 | 32 | |
| 6 | 19 | |
| 1 | 0 | 14 |
| 1 | 2133 | |
| 2 | 2777 | |
| 3 | 3020 | |
| 4 | 305 | |
| 5 | 146 | |
| 6 | 105 |
# average number of months inactive when attrited
Attrited_Customer['Months_Inactive_12_mon'].mean()
2.693300553165335
# average number of months inactive when existing customer
Existing_Customer['Months_Inactive_12_mon'].mean()
2.273764705882353
labeled_barplot(df, 'Customer_Type', 'Months_Inactive_12_mon', perc=False)
Attrited Customers on average have slightly higher inactive months
df.drop('Customer_Type', axis=1, errors='ignore', inplace=True)
Q6ΒΆ
What are the attributes that have a strong correlation with each other?
Below is a list of all of the correlations that exist within the data frame
# find correlation
df.corr()
| Customer_Age | Dependent_count | Education_Level | Income_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | Education_Level_Mapped | Attrition_Flag_Existing Customer | Gender_M | Marital_Status_Married | Marital_Status_Single | Marital_Status_Unknown | Card_Category_Gold | Card_Category_Platinum | Card_Category_Silver | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Customer_Age | 1.000 | -0.122 | -0.002 | 0.024 | 0.789 | -0.011 | 0.054 | -0.018 | 0.002 | 0.015 | 0.001 | -0.062 | -0.046 | -0.067 | -0.012 | 0.007 | -0.002 | -0.018 | -0.017 | 0.047 | -0.011 | -0.027 | -0.012 | 0.007 | -0.019 |
| Dependent_count | -0.122 | 1.000 | 0.000 | 0.066 | -0.103 | -0.039 | -0.011 | -0.041 | 0.068 | -0.003 | 0.068 | -0.035 | 0.025 | 0.050 | 0.011 | -0.037 | 0.000 | -0.019 | 0.005 | 0.014 | -0.041 | 0.042 | 0.027 | 0.004 | 0.017 |
| Education_Level | -0.002 | 0.000 | 1.000 | -0.013 | 0.006 | 0.002 | 0.007 | -0.003 | -0.004 | -0.004 | -0.003 | -0.010 | -0.004 | -0.002 | -0.012 | 0.002 | 1.000 | 0.001 | -0.006 | 0.016 | -0.015 | -0.005 | 0.003 | 0.012 | 0.013 |
| Income_Category | 0.024 | 0.066 | -0.013 | 1.000 | 0.022 | -0.003 | -0.016 | 0.023 | 0.476 | 0.035 | 0.473 | 0.011 | 0.020 | -0.055 | -0.013 | -0.246 | -0.013 | 0.014 | 0.787 | 0.024 | -0.025 | 0.009 | 0.046 | 0.006 | 0.067 |
| Months_on_book | 0.789 | -0.103 | 0.006 | 0.022 | 1.000 | -0.009 | 0.074 | -0.011 | 0.008 | 0.009 | 0.007 | -0.049 | -0.039 | -0.050 | -0.014 | -0.008 | 0.006 | -0.014 | -0.007 | 0.033 | -0.005 | -0.026 | -0.005 | 0.002 | -0.014 |
| Total_Relationship_Count | -0.011 | -0.039 | 0.002 | -0.003 | -0.009 | 1.000 | -0.004 | 0.055 | -0.071 | 0.014 | -0.073 | 0.050 | -0.347 | -0.242 | 0.041 | 0.068 | 0.002 | 0.150 | 0.003 | 0.017 | -0.017 | -0.009 | -0.056 | -0.043 | -0.061 |
| Months_Inactive_12_mon | 0.054 | -0.011 | 0.007 | -0.016 | 0.074 | -0.004 | 1.000 | 0.029 | -0.020 | -0.042 | -0.017 | -0.032 | -0.037 | -0.043 | -0.039 | -0.008 | 0.007 | -0.152 | -0.011 | -0.007 | 0.009 | -0.005 | -0.003 | -0.004 | -0.016 |
| Contacts_Count_12_mon | -0.018 | -0.041 | -0.003 | 0.023 | -0.011 | 0.055 | 0.029 | 1.000 | 0.021 | -0.054 | 0.026 | -0.024 | -0.113 | -0.152 | -0.095 | -0.055 | -0.003 | -0.204 | 0.040 | 0.002 | 0.008 | -0.010 | -0.001 | 0.002 | -0.001 |
| Credit_Limit | 0.002 | 0.068 | -0.004 | 0.476 | 0.008 | -0.071 | -0.020 | 0.021 | 1.000 | 0.042 | 0.996 | 0.013 | 0.172 | 0.076 | -0.002 | -0.483 | -0.004 | 0.024 | 0.421 | -0.057 | 0.032 | 0.025 | 0.234 | 0.106 | 0.441 |
| Total_Revolving_Bal | 0.015 | -0.003 | -0.004 | 0.035 | 0.009 | 0.014 | -0.042 | -0.054 | 0.042 | 1.000 | -0.047 | 0.058 | 0.064 | 0.056 | 0.090 | 0.624 | -0.004 | 0.263 | 0.030 | 0.039 | -0.038 | -0.002 | 0.024 | 0.006 | 0.013 |
| Avg_Open_To_Buy | 0.001 | 0.068 | -0.003 | 0.473 | 0.007 | -0.073 | -0.017 | 0.026 | 0.996 | -0.047 | 1.000 | 0.008 | 0.166 | 0.071 | -0.010 | -0.539 | -0.003 | 0.000 | 0.418 | -0.060 | 0.036 | 0.025 | 0.232 | 0.105 | 0.440 |
| Total_Amt_Chng_Q4_Q1 | -0.062 | -0.035 | -0.010 | 0.011 | -0.049 | 0.050 | -0.032 | -0.024 | 0.013 | 0.058 | 0.008 | 1.000 | 0.040 | 0.005 | 0.384 | 0.035 | -0.010 | 0.131 | 0.027 | 0.052 | -0.043 | -0.013 | 0.006 | 0.004 | 0.003 |
| Total_Trans_Amt | -0.046 | 0.025 | -0.004 | 0.020 | -0.039 | -0.347 | -0.037 | -0.113 | 0.172 | 0.064 | 0.166 | 0.040 | 1.000 | 0.807 | 0.086 | -0.083 | -0.004 | 0.169 | 0.025 | -0.063 | 0.045 | 0.026 | 0.104 | 0.060 | 0.155 |
| Total_Trans_Ct | -0.067 | 0.050 | -0.002 | -0.055 | -0.050 | -0.242 | -0.043 | -0.152 | 0.076 | 0.056 | 0.071 | 0.005 | 0.807 | 1.000 | 0.112 | 0.003 | -0.002 | 0.371 | -0.067 | -0.124 | 0.099 | 0.028 | 0.076 | 0.042 | 0.101 |
| Total_Ct_Chng_Q4_Q1 | -0.012 | 0.011 | -0.012 | -0.013 | -0.014 | 0.041 | -0.039 | -0.095 | -0.002 | 0.090 | -0.010 | 0.384 | 0.086 | 0.112 | 1.000 | 0.074 | -0.012 | 0.290 | -0.006 | 0.014 | -0.010 | 0.004 | -0.003 | -0.009 | -0.003 |
| Avg_Utilization_Ratio | 0.007 | -0.037 | 0.002 | -0.246 | -0.008 | 0.068 | -0.008 | -0.055 | -0.483 | 0.624 | -0.539 | 0.035 | -0.083 | 0.003 | 0.074 | 1.000 | 0.002 | 0.178 | -0.258 | 0.050 | -0.030 | -0.019 | -0.085 | -0.037 | -0.190 |
| Education_Level_Mapped | -0.002 | 0.000 | 1.000 | -0.013 | 0.006 | 0.002 | 0.007 | -0.003 | -0.004 | -0.004 | -0.003 | -0.010 | -0.004 | -0.002 | -0.012 | 0.002 | 1.000 | 0.001 | -0.006 | 0.016 | -0.015 | -0.005 | 0.003 | 0.012 | 0.013 |
| Attrition_Flag_Existing Customer | -0.018 | -0.019 | 0.001 | 0.014 | -0.014 | 0.150 | -0.152 | -0.204 | 0.024 | 0.263 | 0.000 | 0.131 | 0.169 | 0.371 | 0.290 | 0.178 | 0.001 | 1.000 | 0.037 | 0.024 | -0.019 | -0.009 | -0.006 | -0.011 | 0.008 |
| Gender_M | -0.017 | 0.005 | -0.006 | 0.787 | -0.007 | 0.003 | -0.011 | 0.040 | 0.421 | 0.030 | 0.418 | 0.027 | 0.025 | -0.067 | -0.006 | -0.258 | -0.006 | 0.037 | 1.000 | 0.011 | -0.016 | 0.012 | 0.043 | 0.007 | 0.073 |
| Marital_Status_Married | 0.047 | 0.014 | 0.016 | 0.024 | 0.033 | 0.017 | -0.007 | 0.002 | -0.057 | 0.039 | -0.060 | 0.052 | -0.063 | -0.124 | 0.014 | 0.050 | 0.016 | 0.024 | 0.011 | 1.000 | -0.741 | -0.262 | -0.024 | -0.010 | -0.044 |
| Marital_Status_Single | -0.011 | -0.041 | -0.015 | -0.025 | -0.005 | -0.017 | 0.009 | 0.008 | 0.032 | -0.038 | 0.036 | -0.043 | 0.045 | 0.099 | -0.010 | -0.030 | -0.015 | -0.019 | -0.016 | -0.741 | 1.000 | -0.226 | 0.024 | 0.010 | 0.031 |
| Marital_Status_Unknown | -0.027 | 0.042 | -0.005 | 0.009 | -0.026 | -0.009 | -0.005 | -0.010 | 0.025 | -0.002 | 0.025 | -0.013 | 0.026 | 0.028 | 0.004 | -0.019 | -0.005 | -0.009 | 0.012 | -0.262 | -0.226 | 1.000 | 0.012 | 0.004 | 0.018 |
| Card_Category_Gold | -0.012 | 0.027 | 0.003 | 0.046 | -0.005 | -0.056 | -0.003 | -0.001 | 0.234 | 0.024 | 0.232 | 0.006 | 0.104 | 0.076 | -0.003 | -0.085 | 0.003 | -0.006 | 0.043 | -0.024 | 0.024 | 0.012 | 1.000 | -0.005 | -0.026 |
| Card_Category_Platinum | 0.007 | 0.004 | 0.012 | 0.006 | 0.002 | -0.043 | -0.004 | 0.002 | 0.106 | 0.006 | 0.105 | 0.004 | 0.060 | 0.042 | -0.009 | -0.037 | 0.012 | -0.011 | 0.007 | -0.010 | 0.010 | 0.004 | -0.005 | 1.000 | -0.011 |
| Card_Category_Silver | -0.019 | 0.017 | 0.013 | 0.067 | -0.014 | -0.061 | -0.016 | -0.001 | 0.441 | 0.013 | 0.440 | 0.003 | 0.155 | 0.101 | -0.003 | -0.190 | 0.013 | 0.008 | 0.073 | -0.044 | 0.031 | 0.018 | -0.026 | -0.011 | 1.000 |
# use correlation test
# defining the size of the plot
num_features = [
'Customer_Age', 'Dependent_count', 'Education_Level', 'Income_Category',
'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon',
'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio',
'Education_Level_Mapped', 'Attrition_Flag_Existing Customer'
]
data_subset = df[num_features]
# Create the correlation matrix for the selected features or the entire dataset
corr_matrix = data_subset.corr()
# Define the size of the plot
plt.figure(figsize=(12, 7))
# Plot the heatmap for the correlation matrix
sns.heatmap(
corr_matrix, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
# Show the plot
plt.show()
I wanted to find what 2 variables are the most correlated so I can focus in on the variables that are related
# find correlation
corr_matrix = df.corr()
# correlation pairs
corr_pairs = corr_matrix.unstack().sort_values(kind="quicksort")
corr_pairs = corr_pairs[corr_pairs != 1]
# sort pairs by absolute value and get top 5 most correlated pairs
top_corr_pairs = corr_pairs.abs().sort_values(ascending=False).head(10)
top_corr_pairs
| 0 | ||
|---|---|---|
| Credit_Limit | Avg_Open_To_Buy | 0.996 |
| Avg_Open_To_Buy | Credit_Limit | 0.996 |
| Total_Trans_Ct | Total_Trans_Amt | 0.807 |
| Total_Trans_Amt | Total_Trans_Ct | 0.807 |
| Customer_Age | Months_on_book | 0.789 |
| Months_on_book | Customer_Age | 0.789 |
| Gender_M | Income_Category | 0.787 |
| Income_Category | Gender_M | 0.787 |
| Marital_Status_Married | Marital_Status_Single | 0.741 |
| Marital_Status_Single | Marital_Status_Married | 0.741 |
It seems like Avg_Open_To_Buy and Credit_Limit are the most correlated, second is Customer Age and Months on Books, Third is Gender_M and Income_Category,
Here is a breakdown of all the variables
# Pair plot
sns.pairplot(data=data_subset, hue="Attrition_Flag_Existing Customer")
plt.show()